APS/123-QED 



Collectives for the Optimal Combination of Imperfect Objects 

Kagan Turner and David Wolpert 
NASA Ames Research Center, Moffett Field, CA, 94035, USA 
ktumer@mail . arc .nasa. gov dhw@email . arc .nasa. gov 

In this letter we summarize some recent theoretical work on the design of collectives, i.e., of 
systems containing many agents, each of which can be viewed as trying to maximize an associated 
private utility, where there is also a world utility rating the behavior of that overall system that 
the designer of the collective wishes to optimize. We then apply algorithms based on that work on 
a recently suggested testbed for such optimization problems []J. This is the problem of finding the 
combination of imperfect nano-scale objects that results in the best aggregate object. We present 
experimental results showing that these algorithms outperform conventional methods by more than 
an order of magnitude in this domain. 
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I. MOTIVATION AND BACKGROUND 

The optimization problem of finding the subset of a set 
of imperfect devices that results in the best aggregate de- 
vice was recently introduced by Challet and Johnson Q . 
It is an abstraction of what will likely be a major dif- 
ficulty in the construction of systems using nano-scale 
components, arising when a large fraction of those com- 
ponents may be faulty, so that we cannot use a scheme 
fixed ahead of time for interconnecting them. This ab- 
straction is a computationally hard optimization prob- 
lem; brute force approaches cannot be used for large in- 
stances of it. It is also particularly well-suited to testing 
distributed optimization algorithms. 

We propose addressing this problem by associating 
each device with an adaptive reinforcement-learning (RL) 
agent that decides whether or not its device will be 
a member of the subset. It makes this decision based 
on its estimate of which choice will give a larger value 
of its associated private utility function which maps 
the joint choice of all agents into the reals. For such an 
approach to work, we must both ensure that the agents 
do not work at cross-purposes, and that each one has 
a tractable learning problem to solve. Typically these 
two desiderata conflict with one another (utilities that 
account for whether your action is at cross-purposes to 
any other agents' actions are very complex and therefore 
difficult to optimize). So we must find the best way to 
trade them off each other. 

A collective is any such system containing utility- 
maximizing agents, together with an overall world util- 
ity function that rates the possible configurations of 
the overall system. The associated design problem is 
how to configure the collective — and in particular how 
to set the private utility functions — to best optimize 
the world utility. This design problem is related to 
work in many other fields, including multi-agent systems 
(MAS's), computational economics, mechanism design, 
reinforcement learning, statistical mechanics, computa- 
tional ecologies, (partially observable) Markov decision 
processes and game theory. However none of these fields 



is both applicable in large problems, and directly ad- 
dresses the general design problem, rather than a spe- 
cial instance of it. (See for a detailed discussion of 
the relationship between these fields, involving hundreds 
of references.) In particular, since detailed modeling of 
extremely large real-world systems is usually impossi- 
ble, it is crucial to use design algorithms that tdo not 
employ such modeling. We need to make our design 
leveraging only the simple assumption that the agents' 
learning algorithms are individually reasonably behaved. 
Recently some advances have been made in doing just 
that 0, |j| [n) . It is those advances that we propose to 
apply to the problem of Challet and Johnson. 



II. THE MATHEMATICS OF DESIGNING 
COLLECTIVES 

Let £ be an arbitrary space whose elements z give the 
joint move of all agents in the collective system. We wish 
to search for the z that maximizes the provided world 
utility G(z). In addition to G we are concerned with 
private utility functions {g^}, one such function for each 
agent 77 controlling z^. We use the notation ~r\ to refer to 
all agents other than rj. 

Our uncertainty concerning the state of the system is 
reflected in a probability distribution over £. Our abil- 
ity to control the system consists of setting the value of 
some characteristic of the collective, e.g., setting the pri- 
vate utility functions of the agents. Indicating that value 
of the global coordinate by s, our analysis revolves 
around the following central equation for P(G | s), 
which follows from Bayes' theorem: 



P(G I s) = (1) 
dN G P{G I N G ,s) fdN g P(N G I N g ,s)P(N g \ s) , 



where N g and Nq are the intelligence vectors of the 
agents with respect to and G, respectively. Intelli- 
gence, defined as the "standardization" of utility func- 
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tions so that the numeric value they assign to a z only 
reflects their ranking of z relative to certain other ele- 
ments of £, is given by: 

N VtU [z) = J d^. n (z')e[U(z) - U(z')] , (2) 

where O is the Heaviside function, and where the sub- 
script on the (normalized) measure dfi indicates it is re- 
stricted to z' sharing the same non-77 components as z. 

Note that N^„ (z) = 1 means that agent 77 is fully ra- 
tional at z, in that its move maximizes the value of its 
utility, given the moves of the agents. In other words, 
a point z where N^^ (z) — 1 for all agents 77 is one 
that meets the definition of a game-theory Nash equi- 
librium. On the other hand, a z at which all components 
of Nq — 1 is a maximum G along all coordinates of z 
(which of course does not mean it is a maximum along 
an off-axis direction) . So if we can get these two points 
to be identical, then if the agents do well enough at max- 
imizing their private utilities we are assured we will be 
near an axis-maximizing point for G. 

To formalize this, consider our decomposition of P(G \ 
s). If we can choose s so that the third conditional prob- 
ability in the integrand is peaked around vectors N g all 
of whose components are close to 1 , then we have likely 
induced large (private utility function) intelligences. In- 
tuitively, this ensures that the private utility functions 
have high "signal-to- noise" . If we can also have the sec- 
ond term be peaked about Ng equal to N g , then Ng will 
also be large. It is in the second term that the require- 
ment that the private utility functions be "aligned with 
G" arises. Note that our desired form for the second term 
in Equation's assured if we have chosen private utilities 
such that N g equals Ng exactly for all z. Such a system 
is said to be factored. Finally, if the first term in the 
integrand is peaked about high G when Ng is large, then 
our choice of s will likely result in high G, as desired. In 
this letter we concentrate on the second and third terms, 
and show how to simultaneously set them to have the 
desired form. 

As an example, any "team game" in which all private 
utility functions equal G is factored 0. However team 
games often have very poor forms for term 3 in Equa- 
tion [21 forms which get progressively worse as the size 
of the collective grows. This is because for such private 
utility functions each agent 77 will usually confront a very 
poor "signal-to-noise" ratio in trying to discern how its 
actions affect its utility g n = G, since so many other 
agent's actions also affect G and therefore dilute 77's ef- 
fect on its own private utility function. 

We now focus on algorithms based on private utility 
functions {g^} that optimize the signal/noise ratio re- 
flected in the third term, subject to the requirement that 
the system be factored. To understand how these algo- 
rithms work, say we are given an arbitrary function f(z n ) 
over agent 77's moves, two such moves z v and z v 2 , a util- 
ity U, a value s of the global coordinate, and a move 



by all agents other than 77, z-^. Define the associated 
learnability by 

A/(C/;2:>,s,z r , 1 ,z I) 2 ) = (3) 
l[E{U\ z ~yi Zy 1 ) — E{U ; Z' V , z v 2 )} 2 

The expectation values in the numerator are formed by 
averaging over the training set of the learning algorithm 
used by agent 77, n v . Those two averages are evaluated 
according to the two distributions P(U\n n )P(n v \z~ v , z^ 1 ) 
and P(U\n v )P(n n \z~ n , z v 2 ), respectively. (That is the 
meaning of the semicolon notation.) Similarly the vari- 
ance being averaged in the denominator is over n n ac- 
cording to the distribution P(f/|n^)P(n J? |^-^, z n ). 

The denominator in Equation 0] reflects how sensitive 
U(z) is to changing z-^. In contrast, the numerator re- 
flects how sensitive U (z) is to changing z n . So the greater 
the learnability of a private utility function g v , the more 
g n {z) depends only on the move of agent 77, i.e., the bet- 
ter the associated signal-to-noise ratio for 77. Intuitively 
then, so long as it does not come at the expense of de- 
creasing the signal, increasing the signal-to-noise ratio 
specified in the learnability will make it easier for 77 to 
achieve a large value of its intelligence. This can be estab- 
lished formally: if appropriately scaled, g' v will result in 
better expected intelligence for agent 77 than will g v when- 
ever A f (g' v , z- v , s, z v x , z 2 ) > Kf(g v ;z- V ,s,z v 1 ,z 1 2 ) for 
all pairs of moves z^ 1 , 2^ 2 @. 

One can solve for the set of all private utilities that 
are factored with respect to a particular world utility. 
Unfortunately though, in general a collective cannot both 
be factored and have infinite learnability for all of its 
agents However consider difference utilities, of the 
form 

U{z)=p[G{z)-D{z^)\ (4) 

Any difference utility is factored In addition, for all 
pairs z^jZ^ 2 , under benign approximations the differ- 
ence utility maximizing A/({7; z-^, s, z v , Z„ ) by 

D{z~ n ) =E f (G(z) I z~ v ,s) , (5) 

up to an overall additive constant, where the expectation 
value is over z„. We call the resultant difference utility 
the Aristocrat utility (AU), loosely reflecting the fact 
that it measures the difference between a agent's actual 
action and the average action. If each agent 77 uses an 
appropriately rescaled version of the associated AU as its 
private utility function, then we have ensured good form 
for both terms 2 and 3 in Equation [21 

Using AU in practice is sometimes difficult, due to 
the need to evaluate the expectation value. Fortunately 
there are other utility functions that, while being easier 
to evaluate than AU, still are both factored and possess 
superior learnability to the team game utility, g^ = G. 
One such private utility function is the Wonderful Life 
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Utility (WLU). The WLU for agent r\ is parameterized 
by a pre-fixed clamping parameter CL V chosen from 
among ry's possible moves: 



WLUr, ee G{z) - G(z~ v , CL V ) 



(6) 



WLU is factored no matter what the choice of clamping 
parameter. Furthermore, while not matching the high 
learnability of AU, WLU usually has far better learn- 
ability than does a team game, and therefore (when ap- 
ropriately scaled) results in better expected intelligence 



III. COMBINATION OF IMPERFECT OBJECTS 

We now explore the use of collective-based techniques 
for the problem of combining imperfect objects of Chal- 
let and Johnson. The canonical version of this problem 
arises when the objects are all noisy observational devices 
producing a single real number by sampling a Gaussian 
of fixed width centered on the true value of the number. 
The problem is to choose the subset of a fixed collection 
of such devices to have the average (over the members of 
the subset) distortion as close to zero as possible. 

Formally, the problem is to minimize 
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where rij £ {0, 1} is whether device j is or is not se- 
lected, and there are m devices in the collection, having 
associated distortions {%■}. We identify e with the world 
utility, G (so that for these experiments, the goal is to 
minimize G, not maximize it). There are m individual 
agents, each setting one of the rij. The goal is to give 
those agents private utilities so that, as they learn to 
maximize their private utilities, the maximizer of G is 
found. 

Because we wished to concentrate on the effects of the 
utilities rather than on the RL algorithms that use them, 
the agents all used the same (very) simple RL algorithm. 
(We would expect that even marginally more sophisti- 
cated RL algorithms would give better performance.) At 
each timestep each agent runs its algorithm on a data 
set it maintains of action-utility pairs to choose what ac- 
tion to take. This gives a joint action, which in turn sets 
the private utility value for each agent. Combined with 
what action it took at that timestep, that utility value 
for agent j then gets added to the data set maintained by 
agent j. This is done for all agents and then the process 
repeats. 

The agents used their data sets to choose moves by 
maintaining a 2-dimcnsional vector whose components 
at a given timestep are the agent's estimates of the util- 
ity it would receive for taking each of its two possible 
move. Each agent j picks its action at a timestep by 
sampling a Boltzmann distribution whose over the "en- 
ergy spectrum" of j's two utility estimates at that time. 



For simplicity, given how short our runs were, the tem- 
perature in the Boltzmann distribution did not decay in 
time. However to reflect the fact that the environment 
in which an agent is operating changes with time (as the 
other agents change their moves), and therefore the op- 
timal action changes in time, the two utility estimates 
are formed using exponentially aged data: for any time 
step t, the utility estimate j uses for setting either of the 
two actions rij is a weighted average of all the utility val- 
ues it has received at previous times t' that it chose that 
action, with the weights in the average given by an ex- 
ponential of the values t — t'. Finally, to form the agents' 
initial data sets, there is an initialization period in which 
all actions by all agents are chosen uniformly randomly, 
with no learning used. It is after this initialization pe- 
riod ends that the agents choose their actions according 
to the associated Boltzmann distributions. 

For all learning algorithms, the first 20 time steps con- 
stitute the data set initialization period (note that all 
learning algorithms must "perform" the same during that 
period, since none are actually in use then). Starting at 
t = 20, with each consecutive timestep a fixed fraction 
of the agents still choosing their actions randomly switch 
to using their learner algorithms instead, while others 
continue to take random actions. This gradual introduc- 
tion of the learning algorithms is intended to soften the 
"discontinuity" in each agent's environment when the be- 
havior of the other agents start using their learning algo- 
rithms and therefore change their moves. For m = 100, 
four agents turned on their learning algorithms at each 
time step (i.e., by t = 26, all agents were using their 
learners). 
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FIG. 1: Combination of Imperfect Objects Problem, m — 100. 

Figure^ shows the convergence properties of different 
algorithms in a system with 100 agents. The results re- 
ported are based on 20 different {a,j} configurations, each 
performed 50 times (i.e., each point on the figure is the 
average of 20 x 50 = 1000 runs). G, AU, and WLU show 
the performance of agents using reinforcement learners 
with those reinforcement signals provided by G (team 
game), Aristocrat Utility and Wonderful Life Utility re- 
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spectively. S shows the performance of greedy search 
where new n^-'s are generated at each step and selected 
if the solution is better than the current best solution. 
Because the runs are only 200 timesteps long, algorithms 
such as simulated annealing do not outperform simple 
search: there is simply no time for an annealing sched- 
ule. 

Note that because of the discontinuity in the environ- 
ment experienced by each agent as the other agents turn 
on their learning algorithms, the performance of the sys- 
tem as a whole as the agents turn on their learning may 
degrade initially, before settling down (e.g., WLU). Qual- 
itatively, systems in which agents use the G utility have a 
difficult time learning, while systems in which agents use 
AU perform best and the search algorithm falls between 
the performance of the two. 
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FIG. 2: Scaling in the Combination of Imperfect Objects 
Problem. 

We also investigated the scaling properties of each algo- 
rithm. Figure |2l shows these results (the t = 200 average 



performance over 1000 runs) along with the associated 
error bars. As m grows two competing factors come into 
play. On the one hand, there are more degrees of freedom 
to use to minimize G. On the other hand, the problem 
becomes more difficult: the search space gets larger for 
S, and there is more noise in the system for the learning 
algorithms. To account for these effects and calibrate the 
performance values as m varies, in the figure we also pro- 
vide the performance of the algorithm that randomly se- 
lects its action ( "Ran" ). Note that the difference between 
the performances of S and AU increases when the system 
size increases, up to a factor of twenty for m = 1000. 

All algorithms but AU have slopes similar to that of 
"Ran", demonstrating that they cannot use the addi- 
tional degrees of freedom provided by the larger m. Only 
AU effectively uses the new degrees of freedom, provid- 
ing gains that are proportionally higher than the other 
algorithms (i.e., the rate at which AU's performance im- 
proves outpaces what is "expected" based on the random 
algorithm's performance). 

Finally, note that many search algorithms (e.g., gradi- 
ent ascent, simulated annealing, genetic algorithms) can 
be viewed as collectives. However conventionally such 
algorithms use very "dumb" agents. In particular, in 
exploration-exploitation algorithms, the agents typically 
make random moves in the exploration step rather than 
RL to choose the best move. Preliminary results indicate 
that using RL-based agents to determine the moves in 
such exploration steps — intuitively, "agentizing" the in- 
dividual variables of a search problem by providing them 
with adaptive intelligence — can lead to significantly bet- 
ter solutions than are achieved with conventional "dumb 
variable" exploration steps in a myriad of optimization 
problems [To|. 
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